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Abstract 

A statistical model of protein families, called profile con- 
ditional random fields (CRFs), is proposed. This model 
may be regarded as an integration of the profile hidden 
Markov model (HMM) and the Finkelstein-Reva (FR) the- 
ory of protein folding. While the model structure of the 
profile CRF is almost identical to the profile HMM, it can 
incorporate arbitrary correlations in the sequences to be 
aligned to the model. In addition, like in the FR theory, the 
profile CRF can incorporate long-range pairwise interac- 
tions between model states via mean-field-like approxima- 
tions. We give the detailed formulation of the model, self- 
consistent approximations for treating long-range interac- 
tions, and algorithms for computing partition functions and 
marginal probabilities. We also outline the methods for 
the global optimization of model parameters as well as a 
Bayesian framework for parameter learning and selection 
of optimal alignments. 

Introduction 

Protein sequence alignment is one of the most fundamental 
techniques in biological research. Since the early methods 
have been proposecPI^E, techniques for protein sequence 
alignment have made a huge progress toward the detec- 
tion of very weak homology^E Today, most advanced 
methods incorporate some kind of information obtained 
from multiple sequence alignments in terms of sequence 
profiles^ or position-specific scoring matrices (PSSM). In 



sequence profiles, such as used in PSTBLAST-, scores for 
amino acid substitutions are made to be position-specific 
so that subtle evolutionary signals can be embedded in 
each site^. This in turn makes homology search more sen- 
sitive. Profile hidden Markov models (HMMj^H further 
elaborate the sequence profile methods so that deletions 
and insertions are also made position-specific. Although 
powerful, these methods do have some limitations. The 
profile methods (including profile HMMs) assume that 
each position in a profile is independent of other positions 
which makes it difficult to incorporate long-range correla- 
tion among different sites. The importance of long-range 
correlations is evident when one takes into account the 
tertiary structure of a protein in which residues far apart 
along the sequence are in contact to define the specific 
native structure. In practice, one can supplement a plain 
sequence profile with some structural information as in 
three-dimensional (3D) profiled or threading^, but such 
combined approaches remain inherently ad hoc. In case of 
profile HMMs, it is extremely difficult, if not impossible, 
to employ such an approach since the inclusion of site-site 
correlations, both short-range and long-range, may break 
the probabilistic framework of the model. 

In order to incorporate long-range correlations into an 
HMM-like model in a well-defined manner, we present 
in this paper the theoretical formulation of a model based 
on conditional random fields (crf;P. Various CRF-based 
models have been successfully applied to many problems 
in biological domains including pairwise protein sequence 
alignment 1 ^ gene predictiorPl and protein conformation 
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sampling! to name a few. CRFs share many of the advan- 
tages of HMMs while being able to handle site-site corre- 
lations. In the context of profile CRFs, we need to distin- 
guish two types of site-site correlations. One is the corre- 
lations within the sequence which is to be aligned to a CRF 
model; the other is those among the sites within the model. 
The profile CRF model proposed in this paper has no lim- 
itations for incorporating the both types of correlations, 
although some approximations are necessary for the lat- 
ter type in practical applications. Without the model sites 
correlations, the profile CRF model may be regarded as a 
generalization of the profile HMM. Unlike profile HMMs, 
profile CRFs can incorporate many kinds features of the 
sequence in terms of feature functions. With the model 
sites correlations, the profile CRF model may be regarded 
as a generalization of the self-consistent molecular field 
theory of Finkelstein and Reva^^S^, which, in turn, is a 
generalization of the Ising model in one-dimension (ID). 

In this paper, we first present the model structure of 
the profile CRF, provide some examples of possible fea- 
ture functions, and derive some approximations for treat- 
ing long-range correlations between model sites. Next, 
we present algorithms for computing partition functions, 
marginal probabilities, and optimal alignments, followed 
by methods for parameter learning based on multiple se- 
quence alignments. Since our purpose here is to present 
the formulation and algorithms, actual implementation of 
the method and experimentation thereof are left for future 
studies. Nevertheless, we believe that the method pre- 
sented here should serve as a firm basis for the analysis 
of protein sequences and structures. 

Theory 

Profile conditional random field model 

We model a protein family (or a multiple sequence align- 
ment) in an analogous manner as profile HMMs^E (Fig. 
[T). A profile CRF model M. is formally defined as a tuple 
of four components: 

M. = {M, S, T , 6) (1) 

where M is the length of the model M, S = 
{Mk, Ik, Dk} is a set of states indexed by model sites 
k = 0, 1, • • • , M, M + 1. For each site k (1 < k < M), 




1 ... k ... M M+1 



Figure 1: The model structure of a profile conditional 
random field (CRF). Squares, diamonds, and circles are 
matching, insertion, and deletion states, respectively. The 
start and end states are labeled with "S" and "E" in the 
squares. 

there are a matching state Mk, an insertion state Ik, and a 
deletion state Dk- For k = 0, there are only one matching 
state AIq and one insertion state Iq\ for k = M+1, there is 
only one matching state Mm+i ■ The matching states at the 
termini Mo and Mm+i are also called start state and end 
state, respectively, for the reason that will be apparent in 
the following. The model sites with k = 1, ■ • ■ , M may be 
regarded as the core sites of the protein family. The third 
component, T, is a set of feature functions which are asso- 
ciated with model states (S). Each feature function maps 
an amino acid sequence and its site indexes to a real num- 
ber depending on model sites. The last component, 9, is a 
set of parameters or external fields, each of which is asso- 
ciated with a feature function in T. Together with feature 
functions, the external fields are used for evaluating align- 
ments between the model and amino acid sequences. The 
details of these terms will be clarified below. In a profile 
CRF model, the feature functions must be given a priori 
and the values of external fields are learned from a multi- 
ple sequence alignment (MSA). 

The objective is to align a protein sequence x = 
X\Xi ■ ■ ■ xl (called target sequence) to the model. An 
alignment between a target sequence x and a CRF model 
is an ordered sequence of pairs of target sites and model 
states (called site-state pairs in the following): 

A = {(0, Mo), (1, yi), ••-,(», yi ), ■ ■ ■ , (L + 1, M M +i)} 

(2) 

where j/j = S k G {M k , Ik, D k }k=o, - M+i- The pair 
{i,yi) reads as "the target site i is matched to the model 
state yi" It is assumed that if i < j and y^ = Sk and 
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Table 1: Allowed transitions of site-state pairs, i and k 
indicate a site of a target sequence and a site of a CRF 
model, respectively. 



(i, Affc) - 


> (i + l,Mfc+i) 


(i, Mfc) - 


» (i + l,/*) 


(i, Affc) - 


> (*,Afe+i) 




> (i + l,M fe+ i) 


(i,ffe) - 




(*)4) 






* (i + l,Mfc+i) 


(«,-Dfe) - 


» (i + l,J*) 


(i,-Dfe) 





functions Sg(x,i), doublet feature functions tg s ,(x, i) 





S,S' 

and pairwise feature functions ws j s/(x, The singlet 
feature function (SFF) Sg(x, i) is a real-valued function 
representing some feature a of the target sequence when 
Di = S; the doublet feature function (DFF) tg s , (x, i) 
is also a real-valued function representing some feature (3 
when yi- = S and yi = S'. Here, i~ is the predecessor 
of i defined as 
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(if»i = D fc ), 

(if yi = Mfcor/fc). 



(5) 



2/j = S*(, then k < I. An alignment always starts at the start 
state and ends at the end state so that the pairs (0, Mo) and 
(L + 1, Mjvf+i) are fixed in any alignments. In an align- 
ment, not all transitions from one site-state pair to another 
are possible. Allowed transitions are listed in Table[T]and 
depicted in Fig. [T|by arrows. By convention, the match 
to a delete state (i, £)&) means that the deletion resides be- 
tween the i-th and (i + l)-th positions of the sequence. For 
example, an alignment of an 8-residue target sequence to 
an M = 7 profile CRF model might be given as 

A= (x,y) = 

{(0, M ), (1, Jo), (2, Mi), (3, M 2 ), (3, D 3 ), (4, M 4 ), 
(5, M 5 ), (6, h), (7, 7 5 ), (8, M 6 ), (8, D 7 ), (9, M 8 )}. (3) 

As can be inferred from this example, (i, Mk) indicates 
that the i-th residue matches the fc-th core site of the model, 
(i,Ik) indicates that there is an insertion at the i-th site 
in the target sequence, and (i, Dk) indicates that there is 
a deletion between i-th and (i + l)-th sites of the target 
sequence. In terms of ordinary sequence alignment, the 
alignment in Eq. ® may be expressed as 

- Mi M 2 M 3 M 4 M s - - M 6 M 7 
x\ X2 — Xi a;5 xq xr x% — 

(4) 

where the ' — ' signs in the upper and lower rows indicate 
insertions (corresponding to 1^) and deletions (Dk) in the 
model sites. 

Alignments are evaluated in terms of a set of feature 
functions T = {sg, tg s ,, iig s ,}. Three types of fea- 
ture functions are distinguished, namely, singlet feature 



In general, a may depend on S and f3 may depend on S and 
S'. The singlet and doublet feature functions are called 
local or short-ranged since the former represents interac- 
tions at one model site and the latter, interactions between 
two adjacent model sites for which transitions are allowed. 
The pairwise feature function (PFF) u s s ,(x,i,j), repre- 
senting some feature 7, is defined for yi = S and yj = S'. 
While singlet and doublet feature functions are local, pair- 
wise feature functions are non-local in the sense that S and 
S' can be any pair of the model states, not necessarily those 
for which direct transitions are allowed. 

Each of singlet, doublet or pairwise feature functions 
is coupled with a parameter called an external field: 

for Sg, fig s , for t s s ,, and s , for u s s ,. That is, 

9 = {Xg, fig s ,, Vg s ,}. The product of a feature function 
and its coupled external field yields the score of the cor- 
responding feature when a particular target site is aligned 
to a model state. For example, the product A§Sg(x, i) is 
the score of the feature a when the target site i is aligned 
to the model site S. In the formulation of CRF, it is con- 
venient to employ an analogy to statistical physics. Thus, 
the negative total score of an alignment is interpreted as 
the total energy, and the normalization factor for the con- 
ditional probability of alignments as the partition function 
of the target sequence. 

Given an alignment between the model and the se- 
quence, the total energy of an alignment A = (x, y) = 
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{(0,M ),--- ,(i,yi),--- ,(L+1,M M+1 )} is defined by 



for the doublet terms, 



£(y,x, 

-E 



E«( x ^)+E 



A*y._ ,yi^y.- ,yi ( X > *) 



- E E "vuvAv, ( X >*'J) ,6) 

{i<j} 7 

where the summation over {i} means summing along the 
alignment (x, y) (there can be multiple occurrences of the 
same index i due to the matching to deletion states); the 
double summation for i < j is also similarly defined. The 
partition function of this system is thus given by 



Z(x,0) = ^exp[-£(y,x,0)/T] 
M 



(7) 



where the summation is over all possible alignments, and 
T is the temperature (in energy unit). The conditional 
probability of obtaining a particular alignment A = (x, y) 
for a given x is 



P(y\x,o) = 



cxp[-£;(y,x,0)/r] 
Z(x,0) 



(8) 



which is also called the likelihood of the alignment in the 
following. The log-likelihood is defined by 



dL(0\x,y) _ 

XXs'(M)[<fe, B4 -*S',« - P(S,S'\x,i)] (11) 

where P(S, S'\x, i) is the marginal probability that y,- = 
S and yi = S'. Finally for the pairwise terms, 



dL(0\x,y) _ 
d "s,s> 

u} tSf (x,iJ)[Ss !y .8 s > tyj 

{i<j} 



P(S,S'\x,i,j)] (12) 



where P(S, S' |x, i,j) is the marginal probability that yi = 
S and yj = S'. Either when parameters are optimal for a 
given alignment or when the alignment is optimal for given 
parameters, we have dL/d9 = 0. 

Feature functions 

Although we are focused on the formulation of the pro- 
file CRT model, it is instructive to provide some concrete 
examples for feature functions. It should be stressed, how- 
ever, that the actual selection of feature functions will re- 
quire careful experimentation to maximize the effective- 
ness of the profile CRF framework. 



L(0|x,y) = logP(y|x,0) 

= -£7(y,M)/T-logZ(M). (9) 

From here on, we assume T = 1 unless otherwise stated. 
The derivatives of the log-likelihood with respect to the 
parameters, dL/86, are useful both for parameter learning 
and for deriving approximations. For singlet terms, they 
are given as 



dL(0\x,y) 
d\« s 



9%(x,i)[6 s , Vi -P(S\x,i)] (10) 



where 5s, Vi is Kronecker's delta and P(S\x,i) is the 
marginal probability that i-th site of the target sequence is 
aligned to the state S of the model, i.e., yi = S. Similarly 



Singlet feature functions. Singlet feature functions rep- 
resent compatibility measures between a model state and 
a target sequence. It may depend on the whole target se- 
quence as well as on single amino acid residues. One sim- 
ple SFF may be such that 



(13) 



where R is one of the 20 standard amino acid residue 
types. It is implicitly assumed that this feature function 
is defined only when yi = M^. The same assumption ap- 
plies throughout the following discussion. 

If the target sequence is accompanied by its PSSM, the 
above SFF (Eq. [T3l can be generalized as 



PSSM(-R) 



(x,0 = PSSM(i,i?) 



(14) 
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where PSSM(j, R) is the value of the PSSM for residue 
type R at site i. 

SFFs can also depend on multiple sites of the target se- 
quence. For example, let us partition amino acid residues 
into either hydrophobic (1) or hydrophilic (0). Let 67 (x, i) 
be a binary word encoding 19 function of the 7-residue sub- 
sequence x,_3 • ■ ■ x, + 3. Then, the SFF 



,,0000000/ 



J 0000000,6 7 (x,i) 



(15) 



may enhance insertions at highly hydrophilic regions of 
the target sequence. Similarly, the SFF 



0011011 



(x,«) 



30011011,i> 7 (x,i) 



(16) 



may enhance the matching at a-helical regions since the 
binary pattern 0011011 is typical in a helices. There are 
2 7 = 128 types of binary words for 7-residue segments, 
and we can incorporate all of these in a single profile CRF 
model. 

If either predicted or observed structural information is 
available for the target sequence, we may define, for ex- 
ample, 



>H,SS(i) 



(17) 



where SS(i) indicates the secondary structure of site i. 

Doublet feature functions. Doublet feature functions 
represent the feasibility of transitions from one site-state 
pair to another. One trivial example is those that do not 
depend on the target sequence at all. For example, the DFF 



Pairwise feature functions. With pairwise feature func- 
tions, it is possible to incorporate some kind of correlations 
between two states that are not directly connected by tran- 
sitions. Such correlations are most easily grasped in the 
context of the tertiary structure of a protein. Suppose that 
there is a known structure in a protein family to be mod- 
eled as a profile CRF, and that structure contains a pair 
of contacting residues which correspond to the matching 
states Mk and Mi. We may define 



contact(R,R ) 
l M k ,Mi 



CM, 3) 



iSx 4 



(20) 



where R and R' are amino acid residue types. We can de- 
fine different PFFs for different kinds of interactions such 
as hydrogen bonds, salt bridges, hydrophobic contacts, etc. 
Also, sequence-dependence may be made more complex. 
We can combine contacts with binary word encoding, for 
example. 

Approximations for pairwise interactions 

If there are no pairwise terms, exact partition functions and 
exact optimal alignments for profile CRF models can be 
computed efficiently by dynamic programming just as in 
profile HMMs. With pairwise terms present, however, the 
computation of exact solutions is intractable. In order to 
make computations feasible, we need to make some ap- 
proximations. More specifically, we will derive a Bethe 
approximation, which is further simplified to a mean-field 
approximation. 

Observe, first, that the pairwise terms can be rearranged 

as 



*M h ,/ fc (M) = 1 



(18) 



it 7 



CM, 3) 



E J2< 

{i<j} 1 

E EE"2,s' B s 1 s'( x i*)i)Ws',w ( 2l ) 

{i<j} 7 S,S' 

Of course, DFFs can be made to be target sequence- When the alignment is optimal, we have 

(Eq. H2l) . hence the follow- 
ing: 



may be regarded as a feature representing a gap (insertion) 
opening. Similar sequence-independent DFFs can be de- 
fined for all the allowed state transitions. 



dependent. Take the binary word encoding example again. dL(9\x, y) /dvg s 



For example, the following DFF 

.001101 / -\ _ x 

t AI k ,D k + 1 ( x ^) — 0001101, f>e(x,i) 



(19) 



may help to suppress deletions at a-helical regions, since 
the pattern 001101 is typical in a-helices in which dele- 
tions are less likely to occur. 



"s,s'(M, j)8s, yi Ss<, yj = 

{i<j} 

J2 u5 iS ,(x,i,j)P(5,5'|x > i I i). (22) 

{i<j} 
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Using this relation, the pairwise terms are arranged as is optimal. For non-optimal alignments, we approximate 

the total energy by 

E v s,s' u s,S'( x ^hj) S s^S', yj = E(y,*-,0) «-E(y,x,0) (29) 

{i<j} 7 ,S,S' 

^^^^ with Vg = 1, The renormalized singlet feature function 

u s( x ' O-P^l^ *) ( 23 ) (Eq. [28} explicitly accounts forthe pairwise joint probabil- 

{*J 7 ' S ity, and hence it may be called a Bethe or quasi-chemical 

approximation. Furthermore, if we assume two alignment 
where ni x, i) is the renormalized singlet feature function .. . , , . , , . . . . , 

i> y 1 ' b sites are independent, we can decouple the joint marginal 

defined by , 

J probability as 

u5(x,i) = P(S,S'\x,i,j) « P(S\x,i)P(S'\x,j). (30) 

-YY vZ s ,ul s ,( X ,i,j)P(S'\S,x,i,j). (24) This isamean-field approximation. Substituting this into 
2 *—f ' ' Eqs. (125128b . we have the following mean-field energy: 

The conditional marginal probability P(S'\S, x, i,j) is de- u s( x ^) ~ ^ S E ^?,S' U 1,,S'( X > l > j) P ( S '\ x > J)- 
fined by s ' 

(31) 

P(5*, S"|x, i, j) An advantage of this approximation is that we need not to 

P(S'\S,x,i,j) = — p( S \ x \j ■ ^ 25 ) compute the joint marginal probabilities. By using either 

the Bethe (Eq. l24l or the mean-field (Eq. [3D approxima- 
ting if s (x, i) and introducing a coupled external field iP s , tions > the ener gY of ^ alignment is expressed as 
let us define a tentative total energy: 



{'} 



' Y, Mtt.— ,Vjtyi- ,Vi ( x ' *) 




a 7 



(26) " 



(32) 



Note that there are apparently no external field parame- 
ters for the renormalized SFFs (uj(-)); they are included 
By calculating the log-likelihood (Eq. [9) based on this en- in the definitions (Eqs. [24] [3TJ|. Since the mean-field fea- 
ergy and its derivative with respect to z/ 7 . (Eq. [TOj, and ture functions are effectively singlet feature functions, we 
enforcing the optimality condition dL(9\x,y) jdv^ = 0, can apply the standard procedure for learning and align- 
we have ment, provided that the mean-fields are known. Of course, 

the mean-fields are not known in advance so that we need 
to obtain the partition function by an iterative procedure. 
W W That is, 

Substituting this relation into Eq. (23), we have 1 ■ Arbitrarily set u} (•) . 

_^ _^ _^ _^ 2. Calculate the partition function and marginal proba- 

/ j / j v 1i,yj^li,Vj ( X ' h 3) = E ( x > *) (28) bilities based on the previously calculated Mg(-). 

{i<j} 7 {i} 7 

3. Based on the partition function and marginal proba- 
Therefore, the pairwise energy terms can be converted into bilities in the previous step, update Ug(-) by Eq. (24) 

renormalized singlet energy terms as long as the alignment or Eq. (3D . 
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4. Iterate steps 2 and 3 until convergence. 

The algorithms for computing the partition function and 
marginal probabilities are a subject of the next section. 

Algorithms for alignment and learn- 
ing 

Computation of partition function, marginal 
probabilities and optimal alignment 

The partition function (Eq. |7]l and marginal probabilities 
can be calculated efficiently by dynamic programming (or 
transfer matrix method). In this section, we assume that 
pairwise terms are approximated as renormalized SFFs 
(Eqs. l24l[3T1 l, and they are treated as ordinary SFFs. First 
we define the transfer matrix: 

T i (S,S')=exp[e i (S,S')/T\ (33) 

where S', S G S, T is the temperature, and 

ei{S,S') = Y, x s>ss>M + Y,&s-4,s>M- W 

a 

The partition function (Eq. [7]) is then expressed as 

^( x ) = EII Ti ^-'^) (35) 

M {i} 

where the summation is over all possible model states of 
each residue of the target sequence. In order to com- 
pute the partition function Eq. Q51 >. we define an auxil- 
iary function Zij(S k , Si) where i, j = 0, • • ■ , L + 1 and 
S k G {M k ,I k ,D k },Si G {Mj.ii.D,}. Z i>:j (S k ,Si)hthe 
partition function of the sub-sequence XiXi+i ■ ■ ■ Xj where 
its termini i and j are fixed to the model states S k and Si, 
respectively. These conditions are given as 

Z iti (S k ,S) = 6 Sk ,s, (36) 
Z jd (S,S t ) = Ss,s r (37) 

By the construction of the model, the following boundary 
conditions hold in particular: 

Z 0t0 (M ,M ) = 1, (38) 
Z L+l>L+1 (M M+1 ,M M+1 ) = 1. (39) 



The partition function Z(x) is given as 

Z(x) = Z , L+1 (M , M M +i). (40) 

Based on the boundary condition Eq. d36*l l. the follow- 
ing forward recurrence equations for Zi j (S k ,Mi) hold for 
j =«',••■ , L + 1 and / = k, ■ ■ ■ ,M + 1: 

Z h3 (S kl Mi) = Zij-^S^Mi-^TjiMi-^Mi) 
+ Z iij _ 1 {S k ,Ii. 1 )T j (I l _ 1 ,Mi) 

+ Zij^Sk, A-i)r 3 -(A-i,M,); (41) 

Z itj (S k> Ii) = Z i , j - 1 {S k ,M l )T j {M u I l ) 
+ Z liJ _ 1 {S k ,I l )T J (Ii,Ii) 

+ Z iij _ 1 (S k ,D l )T j (D l ,I l ); (42) 

Z itj (S k ,Di) = ZijiS^Mi^T^Mi-uDi) 
+ Z iij (S k ,Ii- 1 )T j (I l _ 1> D l ) 

+ Z hj (S k ,Di- 1 )T j (Di- 1 ,Di). (43) 

It is understood that the terms involving non-existent 
states and/or incompatible state transitions (e.g, 
Zi,i(Mo,Dq), Zi,q(Io,Io), etc.) are ignored. Simi- 
larly, together with the boundary condition Eq. d37l i. the 
backward recurrence equations are given for i = j, ■ • ■ , 
and k = I, ■ ■ ■ , as 

Zi,j(M k , Si) = T i+1 {M k ,M k+1 )Z i+ltj (M k+1 ,Si) 
+ T i+1 (M k ,I k )Z i+hj (I k ,Si) 

+ Ti{M k ,D k+l )Z itj {D k+u Si); (44) 

Zi,j(I k ,Si) = T i+ i(I k , M k+ i)Z i+ ij(M k+ i, Si) 
+ T i+ i(I k ,I k )Z i+ ij(I k ,Si) 

+ T i {I k ,D k+1 )Z itj {D k+u S l ); (45) 

Z itj (D k , Si) = T i+1 (D k ,M k+1 )Z i+hj (M k+1 ,Si) 
+ T i+ i(D k , I k )Z i+ ij(I k , Si) 

+ T t {D k ,D k+1 )Z^{D k+1 ,Si). (46) 
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For convenience, let us define the forward auxiliary func- 
tion Fi(Sk) and the backward auxiliary function Bi(S k ) 
by 

Fi(S k ) = Z 0ti (M ,S k ), (47) 
Bi{S k ) = Z itL+1 {S k ,M M+1 ). (48) 

Using Fi and Bi, and Z^j, we can calculate marginal prob- 
abilities. The joint marginal probability is obtained as 

P(S k , Sl \x,iJ) = W^frWBM) , (49) 

Z(x) 

In particular, when i = j and S k = Si, we have 

Fi{S k )Bi(S k ) 



P{S k \x,i) 



Z(x) 



(50) 



Similarly, for states with allowed transitions 5 and 5' (Ta- 
ble Q), 

P(StSfM= ^m^m. ( 5i) 

Using these marginal probabilities, the renormalized SFFs 
for pairwise terms (Eqs. l24l[3T1 l can be computed. 

The optimal alignment for a given model and a target se- 
quence is the one that yields the minimum energy, which 
corresponds to the free energy of the system at zero tem- 
perature (T = 0). The recurrence equations for the opti- 
mal alignment can be derived as the zero-temperature limit 
of the forward recurrence equations using the following 
formula^: 

lim elog[Ve ai/e l = max a*. (52) 

' i 



That is, if we define a function 

MS k ) = lim [T log Fi(S k )], 



T->0 



(53) 



the energy of the optimal alignment A = (x, y op t) is given 
by 

E(y opt ,x) = -A L+l (M M +i). (54) 
More concretely, we first set the boundary condition 

A (M ) = 0, (55) 



and apply the zero-temperature limit to the both sides 
of the forward recurrence equations for Fi(S k ) = 
Z 0! i(M , S k ) (Eqs. EBBS, we have 

Ai(M k ) = max{[Ai-i(Affc_i) + ei(Af fc _i, M k )], 
[Ai^ih^) + ei{I k -x,M k )], 

[Ai-\(I k -i) + e,(Dfe_i,Mfc)]}; (56) 

Ai{I k ) = max{[A j _i(M Jk ) + tti{M k ,I k )], 
[Ai-i(I k ) + ei{I k ,I k )\, 

[A^iD^ + e^D^h)}}; (57) 

Ai(D k ) = taax{[A j (Affc_i) + e, t (M k -i, D k )}, 
[Mh-^ + eiih-uDk)}, 

[^(Dfc-O + eiCDfc-i.Dfc)]}. (58) 

By tracing back the site-state pairs that yield the optimal 
values of Ai(S k ) at each step, we can find the optimal 
alignment y opt . 

Parameter learning with multiple sequence 
alignment 

Global optimization of parameters. The parameters of 
a profile CRF are the set of external fields A§, fig, s and 
Vg s , (of course, we need to specify the feature functions 
to start to with). The input for parameter learning is a 
multiple sequence alignment (MSA) of a protein family, 
from which the model architecture must be somehow spec- 
ified "by hand." In this process, we need to specify which 
columns of the MSA correspond to matching states. After 
the columns of matching states are determined, matching, 
insertion and deletion states can be assigned to each col- 
umn of each sequence in the MSA. 

After the model architecture has been determined, the 
learning can be done by maximizing the likelihood using 
the sequences of the input MSA. Let these alignments be 
(x' p ) , y^) where p = 1, ■ • ■ , n is the index of sequences. 
The joint log-likelihood is given by 



L(0|{xW yW}) = 

n 

[£(y (p) ,x( p \6>) + logZ(x 



(p) 



fc=i 



(59) 
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Since the total energy is a linear function of the parame- 
ters, and — log Z is the free energy of the system which 
is always convex, the total log-likelihood is also a convex 
function of the parameters. This implies that we can ob- 
tain the globally optimal parameter sets by gradient-based 
methods. In practice, minimizing the bare log-likelihood 
may results in over-fitting of the parameters to the training 
set. Therefore, we define an alternative objective function 
#(0|{xW,yW}) which includes prior probability den- 
sity of the parameters for regularization: 



Suboptimal parameters may be obtained by Markov 
chain Monte Carlo simulations in the 0-space, using 
— log/\ (6\{x( p \ y' p -*}) as the "energy" of the system. 
Since the gradients of the log-likelihood can be computed, 
a hybrid Monte Carlo method is also at our disposal for 
efficient sampling. 

We can also employ hierarchical Bayes learning which 
can automatically adjust the the hyper-parameters for the 
prior, <jg and <To s ,, based on the training sePR 



A-(0|{x« yW}) 



L(0|{x«yW}) 



io g p(0) (60) Discussion 



where P(9) is a Gaussian prior: 



P(8) = ~[[ex P 



a,S 



(A§) 2 



2(*g) a 



x [ cxp 

y,S,S' 



IJ exp 

P,S',S 



13 \2 
S,S'J 



2(*3,S') 2 



(61) 



Here, the hyper-parameters Ug, Og, s and al s , are the 
(expected) standard deviations of the corresponding exter- 
nal fields, and must be specified a priori (however, if we 
use a hierarchical Bayes model, these hyper-parameters 
can be automatically adjusted based on the training data). 
Since we can calculate the gradient of this log-likelihood, 
it is possible to use gradient-based optimization tech- 
niques. Since K(6\{x&\ y&>}) (Eq. |5U) is still convex, 
the globally optimal parameters are guaranteed to be found 
by gradient descent methods. 

Bayesian learning. It is also possible to apply the 
Bayesian learning framework 21 . That is, instead of using a 
single, globally optimal parameter set, we can use a set of 
suboptimal parameters to make robust predictions. From 
Bayes' formula, we have 

P(0|{x( p \yW}) c<exp[L(0|{x (p \yW})]P(6O. (62) 

Using this equation, a Bayesian alignment for the target se- 
quence x may be selected so as to maximize the following 
probability: 

P(y|x,{xW yW}) = /p(y|x,0)P(0|{xW yW})d0. 

(63) 



In this paper, we have formulated the profile CRF to 
model protein families with possible long-range correla- 
tions such as structural information. The profile CRF 
model is clearly an extension of both the molecular field 
theory of Finkelstein and Reva (FR theory jlMHSM and the 
profile HMM^H and hence an integration of these. Here, 
we shall discuss the relationship of the present model with 
these two earlier models. 

The FR theory is particularly focused on 3D structures 
of proteins. Accordingly, its model is explicitly repre- 
sented in the 3D space as a set of lattice points. The lattice 
points mostly correspond to residues in secondary struc- 
ture elements (SSEs), and these points may be regarded as 
"match" states in the present framework. The FR model 
does not allow gaps within each SSE, only insertions are 
allowed in the regions between two SSEs. The energy 
functions (« feature functions) are physics-based ones, 
and the parameters are not optimized to fit some training 
data, but obtained from physical experiments. Therefore, 
the FR models are more suitable for studying physical as- 
pects of protein folding and structure prediction, but less 
so for more general-purpose sequence analysis. Neverthe- 
less, almost all the theoretical foundations of the FR theory 
such as calculation of partition functions, marginal proba- 
bilities, mean-field approximations, but except for parame- 
ter learning, are shared by profile CRFs. After all, the both 
models are extensions of the ID Ising model. 

The analogy between ID Ising model and a more gen- 
eral sequence alignment problem was pointed out by 
Miyazawa 22 , which was further extended to the prob- 
lem of sequence-structure alignment with a mean-field 
approximation^! Later, Koike et al^ applied this analogy 
to compute partition functions and marginal probabilities 
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in protein structure comparison with the Bethe approxi- 
mation. By complementing the FR theory with these tech- 
niques, the alignment algorithm can be made more general, 
and one such generalization is the profile CRF model. The 
improvements made by profile CRFs on the FR theory are 
thus clear: more general treatment of model states, pos- 
sible insertions and deletions at any sites, and parameter 
learning based on MSA. 

Profile HMMs, being a class of generative models, need 
to calculate the joint probability of alignment P(x, y) 
while profile CRFs, being a class of discriminative model, 
directly calculates the conditional probability P(y|x). In 
special cases, with the definition of the conditional prob- 
ability P(y|x) = P(x,y)/P(x) in mind, we may re- 
gard Z(x) as P(x) and exp[— P(y, x)] as P(x, y). More 
specifically, if we define only the following feature func- 
tions (and no others) with appropriate values for external 
fields, we can construct a CRF that is equivalent to a given 
HMM: 

1 . Define singlet feature functions Sg k for matching and 
insertion states as in Eq. ( [Oi l. For deletion states, just 
define a constant SFF (always equal to 1). 

2. Define sequence-independent feature functions t^ fc S[ 
for each transitions as in Eq. ( fl~8b . 

3. Set the singlet external fields as \§ k = log qs h (R) 
(qs k (P): the emission probability of the HMM). 

4. Set the doublet external fields as s = logps,.^, 
(Ps k ,s t '■ transition probability of the HMM). 

However, this equivalence breaks down as soon as we in- 
corporate other feature functions into profile CRFs since 
the Boltzmann factor exp[— P(y, x)] may no longer sat- 
isfy a condition of probability measure (i.e., normalization 
to 1). Thus, HMMs are a very special class of CRFs. 

In summary, we have presented the profile CRF model. 
This model is flexible enough to accommodate almost any 
features of target sequences including PSSM, local se- 
quence patterns, and even long-range correlations. It can 
also incorporate various features of a modeled protein fam- 
ily such as local structures and long-range pairwise in- 
teractions. Although concrete implementations are yet to 
be done, we expect this model to be a useful alternative 
to conventional methods for analyzing and understanding 
protein sequences and structures. 
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